Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#798 - PackageReivsion deletionProposed sometimes deletes the packageRevision #99

Merged
merged 4 commits into from
Sep 5, 2024

Conversation

kushnaidu
Copy link
Contributor

@kushnaidu kushnaidu commented Aug 29, 2024

resolves nephio-project/nephio#798
This change sends a notification of packageRevision change before creating packageRev to avoid the race condition of an invalid ownerRef uid for packageRev, which causes k8s to add a deletionTimestamp to the packageRevs.

@kushnaidu
Copy link
Contributor Author

/retest

@kispaljr
Copy link
Collaborator

kispaljr commented Aug 29, 2024

I may be wrong, but it seems to me that by deleting those lines from the code you removed the following functionality from porch:
The internal metadata store CR (kind: PackageRev) doesn't have an ownerReference to its parent PackageRevision object any more. That ownerReference exists mainly to let the garbage collector know that if the PackageRevision is deleted it can also delete the PackageRev.

Could you elaborate on the reasons and implications of removing this functionality?

@kushnaidu
Copy link
Contributor Author

kushnaidu commented Aug 29, 2024

I may be wrong, but it seems to me that by deleting those lines from the code you removed the following functionality from porch: The internal metadata store CR (kind: PackageRev) doesn't have an ownerReference to its parent PackageRevision object any more. That ownerReference exists mainly to let the garbage collector know that if the PackageRevision is deleted it can also delete the PackageRev.

Could you elaborate on the reasons and implications of removing this functionality?

Hey @kispaljr, thanks for your feedback.
The issue is that after an approve, when the 'main' twin gets created, intermittently, k8s adds the deletionTimestamp to it.
The suspect seemed to be the ownerRefs. Sometimes k8s sees the ownerRefs before the pkgrev is created and it looks at the pkgrev as an orphan. So upon a propose-delete, porch checks if the pkgRev has a deletionTimestamp and if it does, it removes the finalizer and calls for a delete of the pkgrev.
Initially, I didn't remove the ownerRefs, I created the pkgrev without the ownerRefs and then called for an update(adding the ownerRefs) but that didn't help as well.

@kispaljr
Copy link
Collaborator

Thank you for the clarification. I think I understand it much better now.
Can we find a solution that keeps the ownerReference in the PackageRev pointing to the PackageRevision, but avoid the race condition at the same time?

@kushnaidu
Copy link
Contributor Author

In the latest commit, I've reverted the removal of the ownerRefs and changed the flow in repo.go to send a notification of the packagerevision change first before creation of the packageRev.

Copy link
Member

@liamfallon liamfallon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

/approve

@nephio-prow nephio-prow bot added the approved label Sep 4, 2024
Copy link
Collaborator

@efiacor efiacor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit pick

pkg/cache/repository.go Outdated Show resolved Hide resolved
pkg/cache/repository.go Outdated Show resolved Hide resolved
@Catalin-Stratulat-Ericsson
Copy link
Contributor

/approve

1 similar comment
@efiacor
Copy link
Collaborator

efiacor commented Sep 4, 2024

/approve

@liamfallon
Copy link
Member

@kispaljr does the later changes address your concerns?

@kispaljr
Copy link
Collaborator

kispaljr commented Sep 5, 2024

sorry, I haven't got the time to review the new version till now.
The current implementation is definitely an improvement, so I support merging it.

I still cannot see however, how this change guarantees that the parent PackageRevision exists in the porch server by the time the PackageRev is created in the etcd. Please note, that I am not saying that it doesn't guarantee it, I just don't see why, and I am afraid that it only decreases the probability of the race condition to happen by delaying the creation of PackageRevs with the time it takes to send out the notifications.
I guess all I am trying to say is that an explanation of how this solves the problem should be put in a comment in the code, so that developers would understand it easier.

@kispaljr
Copy link
Collaborator

kispaljr commented Sep 5, 2024

Since as I stated there is no doubt that this is an improvement, and the solution was tested so we are sure that in practice it solves the original issue, please ignore my theoretical thoughts above and let's merge this.

/lgtm
/approve

Copy link
Contributor

nephio-prow bot commented Sep 5, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Catalin-Stratulat-Ericsson, efiacor, kispaljr, kushnaidu, liamfallon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kushnaidu
Copy link
Contributor Author

Merge

Hey Istvan,
I tested this change with around 150-200 package revisions and couldn't reproduce the issue even once. When the cached repo is updated, it gives enough time for the pkgRev to use a valid uid I think.

@kispaljr kispaljr merged commit 8a1c5fd into nephio-project:main Sep 5, 2024
5 checks passed
@nagygergo
Copy link
Contributor

nagygergo commented Sep 8, 2024

Yeah, the race condition doesn't fully disappear. If the kube-controller-manager is processing through notifications from porch much slower than processing through notifiactions from kube-apiserver's pkgrev watch, the same issue can happen.

But this can happen between any 2 resources, regardless of which component serves it. I'll check whether there is a safeguard against this in the garbage collector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PackageReivsion deletionProposed sometimes deletes the packageRevision.
6 participants